Evaluating Concrete Strength Model Performance

Using Cross Validation Methods

Sai Devarasheyyt, Mattick, Musson, Perez

2024-07-26

Introduction To Crossvalidation

  • Measure performance and generalizability of machine learning and predictive models.
  • Compare different models constructed from the same data set.

CV widely used in various fields including:

  • Machine Learning
  • Data Mining
  • Bioinformatics
  • Minimize overfitting
  • Ensure a model generalizes to unseen data
  • Tune hyperparameters

Definitions

Generalizability:
How well predictive models created from a sample fit other samples from the same population.

Overfitting:
When a model fits the the underlying patterns of the training data too well.

Model fits characteristics specific to the training set:

  • Noise
  • Random fluctuations
  • Outliers .

Hyperparameters:
Are model configuration variables

Number of nodes and layers in a neural network

Number of branches in a decision tree

Process Steps

  1. Prepare the data:
  • Subsets the data randomly and without replacement into K equally sized folds.
  1. Split the folds into test and training sets
  • 1 test set and K-1 training set.
  1. Fit model to the training data
  2. Repeat steps 2 - 4
  3. Calculate the mean error

temp title

  1. Fit model to the training data
    Take the model you are going to use for prediction and fit it to the training data. Continuing with our example, you would use the 4 training folds to fit the model. Take the fitted model you developed in step 3 and apply it to the 1 test fold. After applying it to the model, you would take the resulting prediction and determine the accuracy by comparing what was predicted from the training folds to the actual values from the test fold.

  2. Repeat steps 2 - 3
    In the example, if you were using K = 5, then you would pick one of the folds you have not previously used and make it the test fold and the other 4 the training fold. In this way, every observation will be a member of the test fold once and training folds 4 times.

  3. Calculate the mean error
    Measure the error after each fold has been used as the test fold. Take the mean measure error of all folds from step 4.
    (Song, Tang, and Wee 2021)

CV options temp title

  1. K-Fold Cross-validation (K-Fold):
    For the K-Fold Cross-validation (K-Fold) method, the dataset is divided into K subsets, knows as “folds”. Each fold contain roughly an equal number of observations. The model is trained on K−1 folds and tested on the remaining holdout fold. Every fold is used as the test set once, and this process is repeated K times Figure 1. The K-Fold method is known for it’s flexibility and adaptability of various data distributions and samplesizes (Browne 2000). Many, including (James et al. 2013), (Gorriz et al. 2024) describe a Bias-Variance Trade-Off when comparing K-Fold and Leave-One-Out Cross-Validation (LOOCV) methods. The computational cost, bias and, variance of these two methods differ. Despite the lower bias in LOOCV, it is recommended to use k-fold in instances where K = 5 or K = 10, over LOOCV because:
    • the computational cost is much lower,
    • it does not show excessive bias,
    • and it does not show excessively high variance.
Figure 1: K-Fold Cross-Validation where K = 5. Created by Author
  1. Leave-One-Out Cross-Validation (LOOCV):
    Leave-One-Out Cross-Validation is a specific case of K-Fold Cross-Validation, where K is the number of observations in the dataset. With this method, the model is trained n times, where each training set has n−1 observations, and the test set has 1 observation Figure 2. This approach is particularly useful for smaller datasets for which separate training and testing sets are impractical (Wong and Yeh 2019). Iterating through all possible n−1 subset results in a performance estimate with low bias (Yates et al. 2023). One drawback to this method is the high computational cost associate with fitting the model n times. When n is large or the model complex, LOOCV can be unfeasible due to the prohibitive computational cost (Adin et al. 2024). When n is small or computational cost is irrelevant, LOOCV is good at giving a thorough, low bias evaluation of model performance (Lei 2020), (Hawkins, Basak, and Mills 2003). The other drawback to LOOCV is high variance. High variance in error estimation brought on by the frequent usage of almost identical training sets (Browne 2000).
Figure 2: Leave-one-out Cross-validation. Created by Author
  1. Nested Cross-Validation:
    Nested Cross-Validation is also very similar to K-fold Cross-Validation. After dividing the data into K-folds with each fold holding a similar amount of data, the model is trained on the outer fold which are K-1 folds while the testing fold remains a single fold. The difference between Nested and K-fold lies in its usage of two loops. The training data becomes the inner loop which optimizes hyperparameters and the test data is the outside loop that assesses the model’s performance Figure 3.

    Nested cross-validation offers a more objective estimation of the model’s capacity for generalization by isolating the hyperparameter tweaking procedure from the performance assessment. Nested Cross-Validation is especially helpful when choosing a model and adjusting hyperparameters (Bradshaw et al. 2023). Recently, recommendations to use this method when n is small have become more common (Raschka 2018). According to (Filzmoser, Liebmann, and Varmuza 2009), nested cross-validation is useful for generating precise performance estimates without causing overfitting, which can happen when hyperparameters are adjusted using the same data that is used for model evaluation. This technique is particularly important in complicated modeling situations when the model’s prediction performance is greatly impacted by parameter adjustment.
Figure 3: Nested Cross-validation where K = 5. Created by Author

Study Objectives

The goal of this paper is to explore the methodology of cross-validation and its application in evaluating the performance of predictive models for concrete strength. Concrete strength is a crucial parameter in construction, directly impacting the safety, durability, and cost-effectiveness of structures. Accurate prediction of concrete strength allows for optimal design, better resource allocation, and improved construction practices. Traditional methods of model validation, such as holdout validation, can sometimes provide misleading performance estimates due to their reliance on a single training-validation split. Cross-validation addresses this limitation by using multiple splits, thus providing a more robust evaluation of the model. As previously mentioned, cross-validation is a statistical technique used to assess the generalizability and reliability of a model by partitioning the data into multiple subsets, training the model on some subsets while validating it on others. This process helps prevent overfitting, ensuring that the model performs well on new, unseen data, and provides a more accurate estimate of the model’s performance.

Concrete strength is a crucial parameter in construction, directly impacting the safety, durability, and cost-effectiveness of structures. Accurate prediction of concrete strength allows for optimal design, better resource allocation, and improved construction practices. Traditional methods of model validation, such as holdout validation, can sometimes provide misleading performance estimates due to their reliance on a single training-validation split. Cross-validation addresses this limitation by using multiple splits, thus providing a more robust evaluation of the model.

In this study, we apply cross-validation to a dataset containing measurements of concrete strength. We aim to demonstrate how different cross-validation techniques, such as k-fold cross-validation and leave-one-out cross-validation, can be used to evaluate the performance of predictive models. By comparing these techniques, we seek to identify the most effective method for assessing model accuracy and reliability in the context of predicting concrete strength.

Methods

Model Measures of Error

Measuring the quality of fit of a regression model is an important step in data modeling. There are several commonly used metrics used to quantify how well a model explains the data. By measuring the quality of fit we can select the model that makes the most accurate predictions on unseen data. Common metrics used to measure model performance are:

  • Mean Absolute Error (MAE)

The Mean Absolute Error is a measure error magnitude. The sine of the error does not matter because MAE uses the absolute value. Small MAE values, “lower magnitude” indicate better model fit. MAE is calculated (1) by averaging the absolute difference between the observed \((y_i)\) and predicted \(\hat{f}(x_i)\) values. Where:

  • \(n\) is the number of observations,
  • \(\hat{f}(x_i)\) is the prediction that the regression function \(\hat{f}\) gives for the ith observation,
  • \(y_i\) is the observed value.

\[ \text{MAE} = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{f}(x_i)| \tag{1} \]

  • Root Mean Squared Error (RMSE)

The Root Mean Squared Error (2) is a measure of error magnitude also. Like MAE, smaller RMSE values indicate better model fit. In this method the square error \((y_i - \hat{f}(x_i))^2\) values are used. Squaring the error give more weight to the larger ones. In contrast with the MAE that uses the absolute error \(|y_i - \hat{f}(x_i)|\) values, all errors are weighted equally regardless of size. Taking the square root returns the error to the same units as the response variable, making it easier to interpret.

\[ \text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i-\hat{f}(x_i))^2} \tag{2} \]

  • R-squared (\(R^2\))

R-squared is the percent of the variance in the response variable that is explained by the predictor variable(s). Unlike MAR and RMSE, \(R^2\) values range from 0 to 1 and the higher the value, the better the fit. An \(R^2\) value of 0.75 indicates that 75% of the variance in the response variable can be explained by the predictor variable(s). The \(R^2\) equation (3) is composed of two key parts, the Total Sum of Squares (\(SS_{tot}\)) and the Residual Sum of Squares (\(SS_{res}\)).

\[ \text{R}^2 = \frac{SS_{tot}-SS_{res}}{SS_{tot}} = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{f}(x_i))^2}{\sum_{i=1}^{n}(y_i-\bar{f}(x_i))^2} \tag{3} \]

(James et al. 2013), (Hawkins, Basak, and Mills 2003), (Helsel and Hirsch 1993)

K-Fold Cross-Validation

\[ CV_{(k)} = \frac{1}{k}\sum_{i=1}^{k} \text{Measuer of Errori}_i \tag{4} \]

Process:

  1. Prepare the data
    Subsets the data randomly and without replacement into K equally sized folds. Each fold will contain approximately n/k observations. For example when n = 200 and K = 5, then each fold have 200/5 = 40 observations. If n = 201, then one of the folds would have 41, the other four folds would have 40 observations.

  2. Split the folds into test and training sets
    In the previous example, if you had 5 folds, we could choose the first set to be the test set and the other 4 would be the training set. It doesn’t make a difference which one you choose as all of the folds will eventually be test folds against the other 4 Figure 1.

  3. Fit model to the training data
    Take the model you are going to use for prediction and fit it to the training data. Continuing with our example, you would use the 4 training folds to fit the model. Take the fitted model you developed in step 3 and apply it to the 1 test fold. After applying it to the model, you would take the resulting prediction and determine the accuracy by comparing what was predicted from the training folds to the actual values from the test fold.

  4. Repeat steps 2 - 4
    In the example, if you were using K = 5, then you would pick one of the folds you have not previously used and make it the test fold and the other 4 the training fold. In this way, every observation will be a member of the test fold once and training folds 4 times.

  5. Calculate the mean error
    Measure the error after each fold has been used as the test fold. Take the mean measure error of all folds from step 4.
    (Song, Tang, and Wee 2021)

Leave One Out Cross-validations (LOOCV)

\[ CV_{(n)} = \frac{1}{n}\sum_{i=1}^{n} \text{Measuer of Errori}_i \tag{5} \]

Process: The steps for LOOCV are almost identical to k-fold cross validation. The only difference is that in K-fold, K must be less than the number of observations (n). In LOOCV, K = n, so when you split the data into testing and training data, the first testing fold is one of the observations and the training data would be every other observation Figure 2. In this way, every observation is tested against every other observation and the process would be repeated n times (James et al. 2013).

Nested Cross-Validation

  1. Split the data into training and testing sets
    As in k-fold cross-validation, break the observations into the single test fold and the training folds. For example, if there are 300 observations and you use K = 5, four of the folds would be training folds and one of them would be the training fold.

  2. Define inner and outer loops
    We define the test fold as the outer loop and use that to test the performance of the model. The training loops will be defined as the inner loop and we will use that to test which parameters we should use.

  3. Split the inner loop into training sets and validation sets
    The inner loop (or training folds) is broken in half. Half of that data will be used as training and the other half will be used as evaluation Figure 3.

  4. Fit the model to the inner loop
    We choose the number of parameters that we are going to use for validation and fit it to the model. After fitting, you will store the accuracy value for those parameters. We then switch the validation and training sets from the inner loop and fit them to the model. After receiving another accuracy score, we would average them together with the previous accuracy score for that number of parameters.

  5. Choose another number of parameters
    We would then choose a different number of parameters and repeat step 4. After determining the average accuracy for the new set of parameters, we would compare it to the average accuracy produced by the other parameters. The number of parameters that produces the highest average accuracy is chosen for that training fold.

  6. Repeat the process K-times
    After getting an accuracy score for each training fold, we find the average of all folds which will give us the average accuracy of the model. (Berrar et al. 2019).

Analysis and Results

Data extraction, transformation and visulation

(I-C Yeh 1998) modeled compression strength of high performance concrete (HPC) at various ages and made with different ratios of components Table 1. The data used for their study was made publicly available and can be downloaded UCI Machine Learning Repository (I-Cheng Yeh 2007).

Table 1: Variables
Name Data Type Units Variable
Strength Quantitative MPa Response
Cement Quantitative kg in a m3 mixture Predictor
Blast Furnace Slag Quantitative kg in a m3 mixture Predictor
Fly Ash Quantitative kg in a m3 mixture Predictor
Water Quantitative kg in a m3 mixture Predictor
Superplasticizer Quantitative kg in a m3 mixture Predictor
Coarse Aggregate Quantitative kg in a m3 mixture Predictor
Fine Aggregate Quantitative kg in a m3 mixture Predictor
Age Quantitative Day (1~365) Predictor

Load Libraries and Data

Data Plots

Data Correlation Plots

The Model

\[ \hat{Strength} = \text{Cement + } \text{Superplasticizer + } \text{Age + } \text{Water} \]

Construct the Linear Regression Model

Linear Regression: K-Fold Cross-validation

Measure_of_Error Result_Value
RMSE 12.13
MAE 9.23
R2 0.46

Linear Regression: Leave-one-out Cross-validation

Measure_of_Error Result_Value
RMSE 12.13
MAE 9.23
R2 0.46

Linear Regression Model: Nested Cross-validation

Measure_of_Error Result_Value
RMSE 11.87
MAE 9.43
R2 0.49

Construct LightGBM Model

LightGBM Model K-Fold Cross-validation

Measure_of_Error Result_Value
RMSE 8.73
MAE 6.82
R2 0.73

LightGBM Model: Leave-one-out Cross-validation

LightGBM Model Nested Cross-validation

Measure_of_Error Result_Value
RMSE 8.27
MAE 6.39
R2 0.75

Cross-validation Method Comparision

Method RMSE MAE R2
Nested CV 11.87 9.43 0.49
5-Fold 12.13 9.23 0.46
LOOCV 12.13 9.23 0.46

Model Comparison

Method Measure MLR LGBM
5-Fold RMSE 12.13 8.73
5-Fold MAE 9.23 6.82
5-Fold R2 0.46 0.73
LOOCV RMSE 12.13 NA
LOOCV MAE 9.23 NA
LOOCV R2 0.46 NA
Nested CV RMSE 11.87 8.27
Nested CV MAE 9.43 6.39
Nested CV R2 0.49 0.75

[1] "Intentionally blank. Takes 10 minutes to render this code chunk \n Will turn it back on to finalize"

Conclution

In this study, we analyzed cross-validation techniques that can be used for evaluating concrete strength modeling performance, including K-Fold, leave-one-out cross-validation, and Nested cross-validation. In our case, we were able to examine the linear regression performance of an entire data set and then compare it with the performance of cross-validation techniques. The findings pointed out that leave-one-out cross validation, K-fold cross validation, and nested cross validation techniques had a better generalization error compared with conventional linear regression models. The detailed models established better results regarding the actual concrete strength. We can note that the nested cross-validation slightly performed better than the k-fold and leave-one-out cross-validation techniques. Further, the research also stressed the feature selection part, as variables like water-cement ratio, age, and type of aggregate were identified as main attributes influencing concrete strength. Overall, this kind of assessment raises the prospect of generating vehicle models for refining forecast precision and effectiveness, as well as investigating the best strategies for developing concrete mixes and promoting construction improvement.

References

Adin, Aritz, Elias Teixeira Krainski, Amanda Lenzi, Zhedong Liu, Joaquı́n Martı́nez-Minaya, and Haavard Rue. 2024. “Automatic Cross-Validation in Structured Models: Is It Time to Leave Out Leave-One-Out?” Spatial Statistics, 100843.
Berrar, Daniel et al. 2019. “Cross-Validation.”
Bradshaw, Tyler J, Zachary Huemann, Junjie Hu, and Arman Rahmim. 2023. “A Guide to Cross-Validation for Artificial Intelligence in Medical Imaging.” Radiology: Artificial Intelligence 5 (4): e220232.
Browne, Michael W. 2000. “Cross-Validation Methods.” Journal of Mathematical Psychology 44 (1): 108–32.
Filzmoser, Peter, Bettina Liebmann, and Kurt Varmuza. 2009. “Repeated Double Cross Validation.” Journal of Chemometrics: A Journal of the Chemometrics Society 23 (4): 160–71.
Gorriz, Juan M, Fermı́n Segovia, Javier Ramirez, Andrés Ortiz, and John Suckling. 2024. “Is k-Fold Cross Validation the Best Model Selection Method for Machine Learning?” arXiv Preprint arXiv:2401.16407.
Hawkins, Douglas M, Subhash C Basak, and Denise Mills. 2003. “Assessing Model Fit by Cross-Validation.” Journal of Chemical Information and Computer Sciences 43 (2): 579–86.
Helsel, Dennis R, and Robert M Hirsch. 1993. Statistical Methods in Water Resources. Elsevier.
James, Gareth, Daniela Witten, Trevor Hastie, Robert Tibshirani, et al. 2013. An Introduction to Statistical Learning. Vol. 112. Springer.
Lei, Jing. 2020. “Cross-Validation with Confidence.” Journal of the American Statistical Association 115 (532): 1978–97.
Raschka, Sebastian. 2018. “Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning.” arXiv Preprint arXiv:1811.12808.
Song, Q Chelsea, Chen Tang, and Serena Wee. 2021. “Making Sense of Model Generalizability: A Tutorial on Cross-Validation in r and Shiny.” Advances in Methods and Practices in Psychological Science 4 (1): 2515245920947067.
Wong, Tzu-Tsung, and Po-Yang Yeh. 2019. “Reliable Accuracy Estimates from k-Fold Cross Validation.” IEEE Transactions on Knowledge and Data Engineering 32 (8): 1586–94.
Yates, Luke A, Zach Aandahl, Shane A Richards, and Barry W Brook. 2023. “Cross Validation for Model Selection: A Review with Examples from Ecology.” Ecological Monographs 93 (1): e1557.
Yeh, I-C. 1998. “Modeling of Strength of High-Performance Concrete Using Artificial Neural Networks.” Cement and Concrete Research 28 (12): 1797–1808.
Yeh, I-Cheng. 2007. Concrete Compressive Strength.” UCI Machine Learning Repository.